class: center, middle, inverse, title-slide # APSTA-GE 2003: Intermediate Quantitative Methods ## Lab Section 003, Week 5 ### New York University ### 10/06/2020 --- ## Reminders - Assignment 3 - Due: **10/19/2020 11:55pm (EST)** - Office hours - Monday 9 - 10am (EST) - Wednesday 12:30 - 1:30pm (EST) - Office hour Zoom link - https://nyu.zoom.us/j/97347070628 (pin: 2003) - Office hour notes - Available on NYU Classes under the "Resources" tab --- ## Today's Topics - Quick Review - Math - R Code - Review class exercise - Using a different dataset --- class: inverse, center, middle # Quick Review ## Summary Statistics --- ## Median **The value at the center** - For odd `\(n\)`: `$$X_\frac{n + 1}{2}$$` - For even `\(n\)`: `$$\frac{X_\frac{n}{2} + X_\frac{n+1}{2}}{2}$$` `\(n\)`: Sample Size ```r # Simulated data dat <- data.frame(x = c(rep(1:3, each = 2), rep(2:8, each = 2))) *median(dat$x) ``` ``` ## [1] 3.5 ``` --- ## Mean **The average value** **Population Mean:** $$ \mu = \frac{\sum_{i=1}^{N} X_i}{N} $$ **Sample Mean:** $$ \overline{x} = \frac{\sum_{i=1}^{n} x_i}{n} $$ `\(N\)`: Population Size, `\(n\)`: Sample Size ```r *mean(dat$x) ``` ``` ## [1] 4.1 ``` --- ## Variance **Expected squared distance from the mean** **Population Variance:** $$ \sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2 $$ **Sample Variance:** $$ s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \overline{x})^2 $$ `\(N\)`: Population Size, `\(n\)`: Sample Size ```r *var(dat$x) ``` ``` ## [1] 5.147368 ``` --- ## Standard Deviation (S.D.) **Expected distance from the mean** **Population SD:** $$ \sigma = \sqrt {\text{Population Variance}} $$ **Sample SD:** $$ s = \sqrt {\text{Sample Variance}} $$ ```r *sd(dat$x) ``` ``` ## [1] 2.268781 ``` --- ## Standard Error (S.E.) **Expected distance of sample means from population mean** **Error: individual-level variation (noise)** In this case, error is the mean variation of each sample. $$ SE = \sqrt{\frac{Var.}{n}} $$ $$ SE \downarrow = \frac{SD \downarrow}{\sqrt{n}} \\ SE \uparrow = \frac{SD \uparrow}{\sqrt{n}} \\ $$ `\(n\)`: Sample size {.smaller} ```r sqrt(var(dat$x)/nrow(dat)) ``` ``` ## [1] 0.5073149 ``` ```r sd(dat$x)/sqrt(nrow(dat)) ``` ``` ## [1] 0.5073149 ``` --- ## Confidence Interval **Estimated population mean** with 95% of confidence, we estimate the population mean is within: $$ (\overline{x} - 1.96 SE, \ \overline{x} + 1.96 SE) $$ ## Prediction vs. Confidence Prediction: individual-level estimation Confidence: average-level estimation --- ## Pooled Variance **Weighted average of the variance of two samples by degrees of freedom** $$ s^2_{pooled} = \frac{Var_1 + Var_2}{n_1 - 1 + n_2 - 1} $$ ## Degrees of Freedom (d.f.) **The level of freedom that you can freely replace values** --- ## T-test and ANOVA T-test: examine mean difference between two samples or between sample and its population. ANOVA: examine the variance of two datasets --- ## Linear Regression $$ Y_i = \beta_0 + \beta_1 \times X_i + \varepsilon $$ where `\(\beta_0\)` is the intercept, `\(\beta_1\)` is the slope, and `\(\varepsilon\)` is the error. $$ \hat{Y} = \beta_0 + \beta_1 X_i $$ **Regression line** is the line that minimizes the **squared distances/errors** between `\(Y_i\)` and the line. In this case: --- ## Interpreting Linear Regression Model ```r lm_demo <- lm(mpgCity ~ weight, data = cars) summary(lm_demo) ``` ``` ## ## Call: ## lm(formula = mpgCity ~ weight, data = cars) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.3580 -1.2233 -0.5002 0.8783 12.6136 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 50.1430417 2.0855429 24.04 <2e-16 *** ## weight -0.0088326 0.0006713 -13.16 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.214 on 52 degrees of freedom ## Multiple R-squared: 0.769, Adjusted R-squared: 0.7645 ## F-statistic: 173.1 on 1 and 52 DF, p-value: < 2.2e-16 ``` --- ## TSS, MSS, and RSS ```r anova(lm_demo) ``` ``` ## Analysis of Variance Table ## ## Response: mpgCity ## Df Sum Sq Mean Sq F value Pr(>F) ## weight 1 1788.39 1788.39 173.09 < 2.2e-16 *** ## Residuals 52 537.26 10.33 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- class: inverse, middle, center # Review class exercise --- **Question 0** Load the dataset, `cars`, and add a new column labelling the row ID ```r *dat <- cars nrow(dat) ``` ``` ## [1] 54 ``` ```r ncol(dat) ``` ``` ## [1] 6 ``` ```r # Add a new column labelling the row ID ``` --- **Question 1** Research question: **how does car's fuel economy change as weight increases?** Get a sense of these two columns by drawing a scatter plot and report sample size. ```r *dat <- cars View(dat) # Create a scatter plot showing the distribution of mpgCity and weight # Hint: plot() # Report sample size ``` --- **Question 2** Research question: **how does car's fuel economy change as weight increases?** Conduct a simple linear regression. ```r # Linear model ``` --- **Question 3** What's the regression coefficient of `weight` on `mpgCity`? Report the appropriate regression coefficient and standard errors. ```r # Linear model ``` --- **Question 4** What are the null and alternative hypotheses tested here? --- **Question 5** Is this coefficient statistically significant? What test do you use? Write down the null and alternative hypotheses, report test statistic and p value. --- **Question 6** Report a 95% confidence interval for the regression coefficient of `weight` on `mpgCity` based on the results from this model. --- **Question 7** Based on the model, calculate the fitted values and residuals for all observations in the model. --- **Question 8** What are the mean values of the fitted values and residuals? --- **Question 9** Calculate or report the total sum of squares, model sum of squares and residual sum of squares. Verify that: `\(TSS=MSS+RSS\)`. --- **Question 10** What’s the R-square of this model? --- **Question 11** What’s the correlation between `mpgCity` and `weight`? Verify the R-square is the correlation squared. --- **Question 12** Estimate standardized regression coefficients in two different approaches: a. Generate a new variable for `mpgCity` by dividing the original value by its standard deviation; generate a new variable for `weight` by dividing its original value by its standard deviation. Run a regression using these two new variables, report the regression coefficients (intercept and slope) b. Use “lm.beta” to estimate the standardized regression coefficients. c. Compare standardized slope with correlation. d. What is the meaning of the intercept now? --- ## Contact Tong Jin - Email: tj1061@nyu.edu - Office Hours - Mondays, 9 - 10am (EST) - Wednesdays, 12:30 - 1:30pm (EST)